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Abstract 

An automated technique has recently been proposed to transfer learning in the hierarchical 
Bayesian optimization algorithm (hBOA) based on distance-based statistics. The technique en- 
ables practitioners to improve hBOA efficiency by collecting statistics from probabilistic models 
obtained in previous hBOA runs and using the obtained statistics to bias future hBOA runs 
on similar problems. The purpose of this paper is threefold: (1) test the technique on several 
classes of NP-complctc problems, including MAXSAT, spin glasses and minimum vertex cover; 
(2) demonstrate that the technique is effective even when previous runs were done on problems 
of different size; (3) provide empirical evidence that combining transfer learning with other 
efficiency enhancement techniques can often yield nearly multiplicative speedups. 

Keywords: Transfer learning, inductive transfer, learning from experience, estimation of distri- 
bution algorithms, hierarchical Bayesian optimization algorithm, decomposable problems, efficiency 
enhancement. 



1 Introduction 

Estimation of distribution algorithms (EDAs) [U EJ [Sj S] guide the search for the optimum by 
building and sampling probabilistic models of candidate solutions. The use of probabilistic models 
in EDAs provides a basis for incorporating prior knowledge about the problem and learning from 
previous runs in order to solve new problem instances of similar type with increased speed, accuracy 
and reliability OE]. However, much prior work in this area was based on hand-crafted constraints 
on probabilistic models [Tj [8l [U [10] which may be difficult to design or even detrimental to EDA 
efficiency and scalability [11]. Recently, Pelikan and Hauschild |12j proposed an automated tech- 
nique capable of learning from previous runs of the hierarchical Bayesian optimization algorithm 
(hBOA) in order to improve efficiency of future hBOA runs on problems of similar type. The 
basic idea of the approach was to (1) design a distance metric on problem variables that correlates 
with the expected strength of dependencies between the variables, (2) collect statistics on hBOA 
models with respect to the values of the distance metric, and (3) use the collected statistics to bias 
model building in hBOA when solving future problem instances of similar type. While the distance 
metric is strongly related to the problem being solved, the aforementioned study [12] described a 
rather general metric that can be applied to practically any problem with the objective function 
represented by an additively decomposable function. However, the prior study [12j evaluated the 
proposed technique on only two classes of problems and it did not demonstrate several key features 
of this technique. 

The purpose of this paper is threefold: (1) Demonstrate the technique from ref. [12] on other 
classes of challenging optimization problems, (2) demonstrate the ability of this technique to learn 
from problem instances of one size in order to introduce bias for instances of another size, and 
(3) demonstrate the potential benefits of combining this technique with other efficiency enhance- 
ment techniques, such as sporadic model building [13]. As test problems the paper considers several 
classes of NP-complete additively decomposable problems, including MAXSAT, three-dimensional 
Ising spin glass, and minimum vertex cover. The new results together with the results published 
in prior work [12] provide strong evidence of the broad applicability and great potential of this 
technique for learning from experience (transfer learning) in EDAs. 

The paper is organized as follows. Section [2] outlines hBOA. Section [3] discusses efficiency 
enhancement of estimation of distribution algorithms using inductive transfer with main focus on 
hBOA and the distance-based bias [12]. Section d] presents and discusses experimental results. 
Section [5] summarizes and concludes the paper. 

2 Hierarchical BOA 

The hierarchical Bayesian optimization algorithm (hBOA) [SKH] works with a population of candi- 
date solutions represented by fixed-length strings over a finite alphabet. In this paper, candidate so- 
lutions are represented by n-bit binary strings. The initial population of binary strings is generated 
at random according to the uniform distribution over candidate solutions. Each iteration starts by 
selecting promising solutions from the current population; here binary tournament selection without 
replacement is used. Next, hBOA (1) learns a Bayesian network with local structures [15] for the 
selected solutions and (2) generates new candidate solutions by sampling the distribution encoded 
by the built network. To maintain useful diversity in the population, the new candidate solutions 
are incorporated into the original population using restricted tournament selection (RTS) [16]. The 
run is terminated when termination criteria are met. In this paper, each run is terminated either 
when the global optimum is found or when a maximum number of iterations is reached. 



hBOA represents probabilistic models of candidate solutions by Bayesian networks with local 
structures |151ll7j. A Bayesian network is defined by two components: (f ) an acyclic directed graph 
over problem variables specifying direct dependencies between variables and (2) conditional proba- 
bilities specifying the probability distribution of each variable given the values of the variable's par- 
ents. A Bayesian network encodes a joint probability distribution as p{Xi , . . . , X„) = ]Xi=i Pi^il^i) 
where Xi is the ith variable (string position) and IIj are the parents of Xi in the underlying graph. 

To represent conditional probabilities of each variable given the variable's parents, hBOA uses 
decision trees |141 I15j. Each internal node of a decision tree specifies a variable, and the subtrees 
of the node correspond to the different values of the variable. Each leaf of the decision tree for a 
particular variable defines the probability distribution of the variable given a condition specified 
by the constraints given by the path from the root of the tree to this leaf (constraints are given by 
the assignments of the variables along this path) . 

To build probabilistic models, hBOA typically uses a greedy algorithm that initializes the 
decision tree for each problem variable Xi to a single-node tree that encodes the unconditional 
probability distribution of Xi. In each iteration, the model building algorithm tests how much a 
model would improve after splitting each leaf of each decision tree on each variable that is not 
already located on the path to the leaf. The algorithm executes the split that provides the most 
improvement, and the process is repeated until no more improvement is possible. Models are 
evaluated using the Bayesian-Dirichlet (BDe) metric with penalty for model complexity, which 
estimates the goodness of a Bayesian network structure given data D and background knowledge 
^ as p{B\D,^) = cp{B\^)p{D\B,S,), where c is a normalization constant [151 [18]. The Bayesian- 
Dirichlet metric estimates the term p(D\B,^) by combining the observed and prior statistics for 
relevant combinations of variables [15]. To favor simpler networks to the more complex ones, the 
prior probability p{B\^) is often set to decrease exponentially fast with respect to the description 
length of the network's parameters [5] [17] . 

3 Learning from Experience using Distance-Based Bias 

In hBOA and other EDAs based on complex probabilistic models, building an accurate probabilistic 
model is crucial to the success [2|[3| llllll9j . However, building complex probabilistic models can be 
time consuming and it may require rather large populations of solutions [2l [3] . That is why much 
effort has been put into enhancing efficiency of model building in EDAs and improving quality of 
EDA models even with smaller populations [6l [HI [9l [20l [21] . Learning from experience [5l [6| [T2 | [20 l 
[2T] represents one approach to addressing this issue. 

The basic idea of learning from experience is to gather information about the problem by 
examining previous runs of the optimization algorithm and to use the obtained information to bias 
the search on new problem instances. The use of bias based on the results of other learning tasks 
is also commonplace in machine learning where it is referred to as inductive transfer or transfer 
learning [221123] . Since learning model structure is often the most computationally expensive task in 
model building, learning from experience often focuses on identifying regularities in model structure 
and using these regularities to bias structural learning in future runs. 

Analyzing probabilistic models built by hBOA and other EDAs is straightforward. The more 
challenging facet of implementing learning from experience in practice is that one must make sure 
that the collected statistics are meaningful with respect to the problem being solved. The key to 
make the learning from experience work is to ensure that the pairs of variables are classified into a 
set of categories so that the pairs in each category have a lot in common and can be expected to be 
either correlated or independent simultaneously [12]. This section describes one approach to doing 



that |12j . in which pairs of variables are classified into categories based on a predefined distance 
metric on variables. 

3.1 Distance Metric for Additively Decomposable Functions 

For many optimization problems, the objective function (fitness function) can be expressed as an 
additively decomposable function (ADF): 

m 
f{Xi,...,Xn) = ^MSi), (1) 

1=1 
where {Xi, . . . ,Xn) are problem's decision variables (string positions), fi is the ith subfunction, 
and Si C {Xi,X2, . . . ,Xn} is the subset of variables contributing to fi. While there may often 
exist multiple ways of decomposing the problem using additive decomposition, one would typically 
prefer decompositions that minimize the sizes of subsets {Si}. Note that the difficulty of ADFs is 
not fully determined by the order of subproblems, but also by the definition of the subproblems and 
their interaction; even with subproblems of order only 2 or 3, the problem can be NP-complete. 

The definition of a distance between two variables of an ADF used in this paper as well as 
ref. [12] follows the work of Hauschild et al. [6l \TT\ I20j . Given an ADF, we define the distance 
between two variables using a graph G of n nodes, one node per variable. For any two variables Xi 
and Xj in the same subset S^, we create an edge in G between the nodes Xi and Xj. Denoting by 
lij the number of edges along the shortest path between Xi and Xj in G (in terms of the number 
of edges) , we define the distance between two variables as 

nfY Y \ — i '*'■? ^^ ^ path between Xi and Xj exists, 
-^ \ n otherwise. 

The above distance measure makes variables in the same subproblem close to each other, whereas 
for the remaining variables, the distances correspond to the length of the chain of subproblems that 
relate the two variables. The distance is maximal for variables that are completely independent 
(the value of a variable does not influence the contribution of the other variable in any way) . 

Since interactions between problem variables are encoded mainly in the subproblems of the 
additive problem decomposition, the above distance metric should typically correspond closely to 
the likelihood of dependencies between problem variables in probabilistic models discovered by 
EDAs. Specifically, the variables located closer with respect to the metric should more likely 
interact with each other. This observation has been confirmed with numerous experimental studies 
across a number of important problem domains from spin glasses distributed on a finite-dimensional 
lattice dUIl] to NK landscapes [H]. 



3.2 Distance-Based Bias Based on Previous Runs of hBOA 

This section describes the approach to learning from experience developed by Pelikan and 
Hauschild [12] inspired mainly by the work of Hauschild et al. [U [20} 121] . Let us assume a set 
M of hBOA models from prior hBOA runs on similar problems. Before applying the bias based 
on prior runs in hBOA, the models in M are first processed to generate data that will serve as the 
basis for introducing the bias. The processing starts by analyzing the models in M to determine 
the number s{m, d,j) of splits on any variable Xi such that D{Xi,Xj) = d in a decision tree Tj for 
variable Xj in a model m G M. Then, the values s{m,d,j) are used to compute the probability 
Pkid,j) of a kth. split on a variable at distance d from Xj in a dependency tree Tj given that k — 1 



such splits were already performed in Tj-. 

p., \{meM:sim,d,j)>k}\ 

^^'^^ \{meM:s{m,d,j)>k-l}\' ^' 

Recall that the BDe metric for evaluating the quality of probabilistic models in hBOA contains two 
parts: (1) the prior probability p{B\S^) of the network structure B, and (2) the posterior probability 
p{D\B,^) of the data (population of selected solutions) given B. Pelikan and Hauschild [12] pro- 
posed to use the prior probability distribution p{B\^) to introduce a bias based on distance-based 
statistics from previous hBOA runs represented by Pk{d,j) by setting 

n n ns(d,j) 

p{B\0 = cllll n Pkid^j)^ (3) 

d=lj=l k=l 

where ns{d,j) denotes the number of splits on any variable Xi in Tj such that D(Xi,Xj) = d, k > 
is used to tune the strength of bias (the strength of bias increases with k) , and c is a normalization 
constant. Since log-likelihood is typically used to evaluate model quality, when evaluating the 
contribution of any particular split, the change of the prior probability of the network structure 
can still be done in constant time. 

4 Experiments 

4.1 Test Problems and Experimental Setup 

The experiments were done for three problem classes known to be difficult for most genetic and 
evolutionary algorithms: (1) Three-dimensional Ising spin glasses were considered with itJ cou- 
plings and periodic boundary conditions |24ll25j : two problem sizes were used, n = 6x6x6 = 216 
spins and n = 7x7x7 = 343 spins with 1,000 unique problem instances for each n. (2) Minimum 
vertex cover was considered for random graphs of fixed ratio c of the number of edges and number 
of nodes [261 I27j : two ratios (c = 2 and c = 4) and two problem sizes (n = 150 and n = 200) 
were used with 1,000 unique problem instances for each combination of c and n. (3) MAXSAT 
was considered for mapped instances of graph coloring with graphs created by combining regular 
ring lattices (with probability 1 — p) and random graphs (with probability p) |28^ I29j: 100 unique 
problem instances of n = 500 bits (propositions) were used for each considered value of p, from 
p = 2~^ (graphs nearly identical to a regular ring lattice) to p = 2~^ (graphs with half of the edges 
random). For more information about the test problems, we refer the reader to refs. [2411261 [28]. 

The maximum number of iterations for each problem instance was set to the number of bits in 
the problem; according to preliminary experiments, this upper bound was sufficient. Each run was 
terminated either when the global optimum was found, when the population consisted of copies 
of a single candidate solution, or when the maximum number of iterations was reached. For each 
problem instance, we used bisection [5l [3^ to ensure that the population size was within 5% of 
the minimum population size to find the optimum in 10 out of 10 independent runs. Bit-fiip hill 
climbing (HC) [5] was incorporated into hBOA to improve its performance on all test problems 
except for the minimum vertex cover; HC was used to improve every solution in the population. For 
minimum vertex cover, a repair operator based on ref. [26] was incorporated instead. The strength 
of the distance-based bias was tweaked using k G {1, 3, 5, 7, 9}. 

To ensure that the same problem instances were not used for defining the bias as well as for 
testing it, 10-fold crossvalidation was used when evaluating the effects of distance-based bias derived 
from problem instances of the same size. For each set of problems (by a set of problems we mean a 



set of random problem instances generated with one specific set of parameters), problem instances 
were randomly split into 10 equally sized subsets. In each round of crossvalidation, 1 subset of 
instances was left out and hBOA was run on the remaining 9 subsets of instances. The runs on 
the 9 subsets produced models that were analyzed in order to obtain the probabilities Pk{d,j) for 
all d, j, and k. The bias based on the obtained values of Pk{d,j) was then used in hBOA runs on 
the remaining subset of instances. The same procedure was repeated for each subset; overall, 10 
rounds of crossvalidation were performed for each set of instances. When evaluating the effects of 
distance-based bias derived from problem instances of smaller size, we did not use crossvalidation 
because in this case all runs had to be done on different problem instances (of different size). Most 
importantly, in every experiment, models used to generate statistics for hBOA bias were obtained 
from hBOA runs on different problem instances. While the experiments were performed across 
a variety of computer architectures and configurations, the base case with no bias and the case 
with bias were always both run on the same computational node; the results of the two runs could 
therefore be compared against each other with respect to the actual CPU (execution) time. 

To evaluate hBOA performance, we focus on the multiplicative speedup with respect to the 
execution time per run; the speedup is defined as a multiplicative factor by which the execution 
time improves with the distance-based bias compared to the base case. For example, an execution- 
time speedup of 2 indicates that the bias allowed hBOA to find the optimum using only half the 
execution time compared to the base case without the bias. We also report the percentage of runs 
for which the execution time was strictly improved (shown in parentheses after the corresponding 
average multiplicative speedup). 

In addition to the speedups achieved for various values of k, we examine the ability of the 
distance-based bias based on prior runs to apply across a range of problem sizes; this is done by 
using previous runs on instances of one size to bias runs on instances of another size. Since for 
MAXSAT, we only used instances of one size, this facet was only examined for the other two 
problem classes. 

Finally, we examine the combination of the distance-based bias based on prior runs and the 
sporadic model building [13]. Specifically, we apply sporadic model building on its own using the 
model-building delay of -v/n/2 as suggested by ref. [13] , and then we carry out a similar experiment 
using both the distance-based bias as well as the sporadic model building, recording the speedups 
with respect to the base case. Ideally, we would expect the speedups from the two sources to 
multiply. Due to the time requirements of solving MAXSAT, the combined effects were studied 
only for the remaining two problem classes. 

4.2 Results 

The results presented in tables [H [2] and [3] confirm the observation from ref. [12] that the stronger the 
bias the greater the benefits, at least for the examined range of k G {1, 3, 5, 7, 9} and most problem 
settings; that is why in the remainder of this discussion we focus on k = 9. In all cases, the distance- 
based bias yielded substantial speedups of about 1.2 to 3.1. Best speedups were obtained for the 
minimum vertex cover. In all cases, performance on at least about 70% problem instances was 
strictly improved in terms of execution time; in most cases, the improvements were observed in a 
much greater majority of instances. The speedups were substantial even when the bias was based on 
prior runs on problem instances of different, smaller size; in fact, the speedups obtained with such a 
bias were nearly identical to the speedups with the bias based on the instances of the same size. The 
results thus provide clear empirical evidence that the distance-based bias is applicable even when 
the problem instances vary in size, which was argued [12] to be one of the main advantages of the 
distance-based bias over prior work in the area but was not demonstrated. Finally, the results show 



Table 1: Results for 3D spin glass. 



(a) Results for 10-fold crossvalidation 
with priors from other instances of 
the same size. 



K 


CPU speedup 
n = 216 n = 343 


1 

3 

5 
7 
9 


0.40 ( 0%) 0.43 ( 0%) 
1.00 (43%) 1.08 (60%) 

1.23 (71%) 1.32 (85%) 

1.24 (70%) 1.34 (81%) 
1.21 (66%) 1.20 (67%) 



(b) Results for n = 343 with 
priors based on models ob- 
tained on problem instances of 
smaller size, n = 216. 



K 


CPU speedup 


1 

3 

5 
7 
9 


0.43 ( 1%) 
1.05 (61%) 

1.33 (85%) 

1.34 (82%) 
1.26 (75%) 



(c) Results for a combination of 
distance-based bias (DBB) and spo- 
radic model building (SMB) for n = 
343. 10-fold crossovalidation was used. 



K 


CPU speedup 
DBB+SMB SMB 


1 

3 

5 
7 
9 


1.85 (99%) 
3.29 (99%) 
4.04 (99%) 
4.23 (99%) 
4.03 (99%) 


3.20 (99%) 
3.20 (99%) 
3.20 (99%) 
3.20 (99%) 
3.20 (99%) 



the nearly multiplicative effect of the distance-based bias and sporadic model building, providing 
further support for the importance of the distance-based bias; the combined speedups ranged from 
about 4 to more than 11. 

5 Summary and Conclusions 

This paper extended the prior work on efficiency enhancement of the hierarchical Bayesian opti- 
mization algorithm (hBOA) using a distance-based bias derived from prior hBOA runs |12j . The 
paper demonstrated that (1) the distance-based bias yields substantial speedups on several previ- 
ously untested classes of challenging, NP-complete problems, (2) the approach is applicable even 
when prior runs were executed on problem instances of different size, and (3) the approach can 
yield nearly multiplicative speedups when combined with other efficiency enhancement techniques. 
In summary, the results presented in this paper together with the prior work [12] provide clear 
evidence that learning from experience using a distance-based bias has a great potential to improve 
efficiency of hBOA in particular and estimation of distribution algorithms (EDAs) in general. 

Several topics are of central importance for future work. The approach should be adapted to 
other model-directed optimization techniques, including other EDAs and genetic algorithms with 
linkage learning. The approach should also be modified to introduce bias on problems that cannot be 
formulated using an additive decomposition in a straightforward manner or such a decomposition is 
not practical. Finally, it is important to study the limitations of the proposed approach, and create 
theoretical models to automatically tune the strength of the bias and predict expected speedups. 
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Table 2: Results for minimum vertex cover. 



(a) Results for 10-fold crossvalidation 
with priors from other instances of the 



same size. 



c = 2 


K 


CPU speedup 
n = 150 n = 200 


1 

3 

5 
7 
9 


0.57 ( 2%) 0.45 ( 0%) 
1.95 (91%) 1.63 (87%) 
2.78 (96%) 2.69 (94%) 
3.04 (95%) 2.98 (94%) 
3.10 (93%) 2.95 (92%) 



c = 4 


K 


CPU speedup 
n = 150 n = 200 


1 

3 

5 
7 
9 


0.28 ( 0%) 0.17 ( 0%) 
0.97 (39%) 0.53 ( 4%) 
1.56 (82%) 1.16 (62%) 
1.97 (88%) 1.65 (81%) 
2.27 (89%) 1.91 (85%) 



(b) Results for n = 200 with 
priors based on models ob- 
tained on problem instances of 
smaller size, n — 150. 



c = 2 


K 


CPU speedup 


1 

3 

5 
7 
9 


0.53 ( 2%) 
1.95 (91%) 
2.79 (95%) 
2.99 (94%) 
3.02 (91%) 



c = 4 


K 


CPU speedup 


1 

3 

5 

7 
9 


0.23 ( 0%) 
0.86 (27%) 
1.50 (79%) 
1.89 (85%) 
2.12 (84%) 



(c) Results for a combination of 
distance-based bias (DBB) and spo- 
radic model building (SMB) for n — 
200. 10-fold crossovalidation was 
used. 



Table 3: Results for MAXSAT. 



CPU speedup 
DBB-FSMB SMB 



3.12 ( 99%) 

6.89 (100%) 

10.25 (100%) 

11.38 (100%) 

11.29 ( 99%) 



4.89 
4.89 
4.89 
4.89 
4.89 



c = 4 


K 


CPU speedup 
DBB+SMB SMB 


1 

3 

5 
7 
9 


1.88 ( 82%) 4.54 
3.24 ( 96%) 4.54 
5.00 ( 99%) 4.54 
6.15 ( 99%) 4.54 
6.60 ( 99%) 4.54 



K 


p = 2-i 


CPU speedup 
p = 2"2 p = 2-4 


p = 2-8 


1 

3 

5 
7 
9 


0.13 ( 0%) 
0.41 ( 0%) 
0.81 (25%) 
1.38 (69%) 
2.31 (94%) 


0.22 ( 0%) 0.22 ( 0%) 
0.53 ( 0%) 0.48 ( 0%) 
0.82 (18%) 0.74 ( 4%) 
1.09 (55%) 1.03 (54%) 
1.38 (81%) 1.28 (89%) 


0.38 ( 0%) 
1.01 ( 49%) 
1.63 (100%) 
1.84 (100%) 
1.90 (100%) 
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